translated by 谷歌翻译
现代深度学习(DL)架构使用使用$ \ Texit运行的SGD算法的变体训练训练{手动} $定义的学习率计划,即,在预定义的时期删除了学习率,通常在训练时损失预计会饱和。在本文中,我们开发了一种实现学习率下降$ \ Texit {自动} $的算法。所提出的方法,即我们称为Autodrop,通过观察到模型参数的角速度,即收敛方向的变化的速度,用于固定学习速率最初迅速增加,然后朝向软饱和。在饱和时,优化器减慢,因此角速度饱和度是用于降低学习率的良好指标。在下降之后,角速度“重置”并遵循先前描述的图案 - 它再次增加,直到饱和度。我们表明,我们的方法改善了SOTA培训方法:它加快了对DL模型的培训并导致更好的概括。我们还表明,我们的方法不需要任何额外的额外的覆盖器调整。 AutoDrop进一步实现和计算方式非常简单。最后,我们开发了一个分析我们算法的理论框架,并提供了收敛保证。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
我们考虑在培训深度学习模型的通信约束下分布式优化。我们提出了一种新的算法,其参数更新依赖于两个力量:常规渐变步骤,以及当前最佳性能的工人(领导者)决定的纠正方向。我们的方法以多种方式与参数平均方案EASGD不同:(i)我们的客观制定与原始优化问题相比,我们的客观制定不会改变静止点的位置; (ii)我们避免通过将彼此不同局部最小值下降的本地工人拉动的融合减速(即其参数的平均值); (iii)我们的设计更新破坏了对称性的诅咒(被困在对称非凸景观中的透过透过透过次优溶液中的现象); (iv)我们的方法更加沟通高效,因为它仅广播领导者而不是所有工人的参数。我们提供了对所提出的算法的批量版本的理论分析,我们称之为领导者梯度下降(LGD)及其随机变体(LSGD)。最后,我们实现了算法的异步版本,并将其扩展到多领导者设置,我们组成的工人组,每个人都由自己的本地领导者(组中最佳表现者)表示,并使用纠正措施更新每个工作人员方向由两个有吸引力的力量组成:一个到当地,一个到全球领导者(所有工人中最好的表演者)。多引导设置与当前的硬件架构良好对齐,其中形成组的本地工人位于单个计算节点内,不同的组对应于不同的节点。对于培训卷积神经网络,我们经验证明了我们的方法对最先进的基线比较。
translated by 谷歌翻译
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.
translated by 谷歌翻译
We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural network through the prism of the results from random matrix theory. We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum. The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error. This emphasizes a major difference between large-and small-size networks where for the latter poor quality local minima have nonzero probability of being recovered. Finally, we prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.
translated by 谷歌翻译
Novel topological spin textures, such as magnetic skyrmions, benefit from their inherent stability, acting as the ground state in several magnetic systems. In the current study of atomic monolayer magnetic materials, reasonable initial guesses are still needed to search for those magnetic patterns. This situation underlines the need to develop a more effective way to identify the ground states. To solve this problem, in this work, we propose a genetic-tunneling-driven variance-controlled optimization approach, which combines a local energy minimizer back-end and a metaheuristic global searching front-end. This algorithm is an effective optimization solution for searching for magnetic ground states at extremely low temperatures and is also robust for finding low-energy degenerated states at finite temperatures. We demonstrate here the success of this method in searching for magnetic ground states of 2D monolayer systems with both artificial and calculated interactions from density functional theory. It is also worth noting that the inherent concurrent property of this algorithm can significantly decrease the execution time. In conclusion, our proposed method builds a useful tool for low-dimensional magnetic system energy optimization.
translated by 谷歌翻译
The release of ChatGPT, a language model capable of generating text that appears human-like and authentic, has gained significant attention beyond the research community. We expect that the convincing performance of ChatGPT incentivizes users to apply it to a variety of downstream tasks, including prompting the model to simplify their own medical reports. To investigate this phenomenon, we conducted an exploratory case study. In a questionnaire, we asked 15 radiologists to assess the quality of radiology reports simplified by ChatGPT. Most radiologists agreed that the simplified reports were factually correct, complete, and not potentially harmful to the patient. Nevertheless, instances of incorrect statements, missed key medical findings, and potentially harmful passages were reported. While further studies are needed, the initial insights of this study indicate a great potential in using large language models like ChatGPT to improve patient-centered care in radiology and other medical domains.
translated by 谷歌翻译
Efficient surrogate modelling is a key requirement for uncertainty quantification in data-driven scenarios. In this work, a novel approach of using Sparse Random Features for surrogate modelling in combination with self-supervised dimensionality reduction is described. The method is compared to other methods on synthetic and real data obtained from crashworthiness analyses. The results show a superiority of the here described approach over state of the art surrogate modelling techniques, Polynomial Chaos Expansions and Neural Networks.
translated by 谷歌翻译